To make the quality rating more readable, I assigned three quality levels to all the wines with different quality scores: ‘poor’ for wine’s quality below 5, ‘average’ for wine’s quality between 5 and below 7, ‘high’ for wine’s quality above 7.
uni_qplot <- function(variable1, variable2){
ggplot(data=WINE,aes_q(as.name(variable1)))+
geom_histogram(binwidth=variable2)+
ggtitle(variable1)
}
uni_qplot('fixed.acidity',0.2)+
scale_x_continuous(breaks = 4:16)
uni_qplot('volatile.acidity',0.02)+
scale_x_continuous(breaks = seq(0,1.6,0.1))
uni_qplot('citric.acid',0.05)+
scale_x_continuous(breaks = seq(0,1,0.1))
uni_qplot('residual.sugar',0.2)+
scale_x_continuous(breaks = seq(0,10,1))+
coord_cartesian(xlim = c(0,10))
uni_qplot('chlorides',0.01)+
scale_x_continuous(breaks = seq(0,0.2,0.05))+
coord_cartesian(xlim = c(0,0.2))
uni_qplot('free.sulfur.dioxide',2)+
scale_x_continuous(breaks = seq(0,70,5))+
coord_cartesian(xlim = c(0,45))
uni_qplot('total.sulfur.dioxide',5)+
scale_x_continuous(breaks = seq(0,300,25))+
coord_cartesian(xlim = c(0,175))
uni_qplot('density',0.0005)+
scale_x_continuous(breaks = seq(0.99,1.0025,0.0025))
uni_qplot('pH', 0.02)+
scale_x_continuous(breaks = seq(0,4.5,0.1))
uni_qplot('sulphates',0.05)+
scale_x_continuous(breaks = seq(0,2,0.25))+
coord_cartesian(xlim = c(0.25,1.25))
uni_qplot('alcohol',0.5)+
scale_x_continuous(breaks = seq(8,15,1))
uni_qplot('quality',0.3)
names(WINE)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "quality_lv"
str(WINE)
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality_lv : Ord.factor w/ 3 levels "poor"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
summary(WINE)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality_lv
## Min. : 8.40 Min. :3.000 poor : 63
## 1st Qu.: 9.50 1st Qu.:5.000 average:1319
## Median :10.20 Median :6.000 high : 217
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The dataset WINE has in total 1599 red wines and 12 features, which includes: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality.
fixed.acidity’s median value: 7.90, max value:15.90, 75% of the red wines have less than 9.20 fixed.acidity.
volatile.acidity’s median value:0.52, max value:1.58, 75% of the red wines have less than 0.64 volatile.acidity.
citric.acid’s median value: 0.26, max value:1.00, 75% of the red wines have less than 0.42 citric.acid.
residual.sugar’s median value: 2.20, max value:15.50, 75% of the red wines have less than 2.60 residual.sugar.
chlorides’s median value: 0.079, max value:0.611, 75% of the red wines have less than 0.09 chlorides.
free.sulfur.dioxide’s median value:14.00, max value:72.00, 75% of the red wines have less than 21.00 sulfur.dioxide.
total.sulfur.dioxide’s median value:38.00, max value:289.00, 75% of the red wines have less than 62.00 total.sulfur.dioxide
density’s median value:0.9968, max value:1.0037, 75% of the red wines have less than 0.9978 density.
pH’s median value: 3.31, max value:4.01, 75% of the red wines have less than 3.4 pH.
sulphates’s median value:0.62, max value:2.00, 75% of the red wines have less than 0.73 sulphates.
alcohol’s median value: 10.20, max value: 14.90, 75% of the red wines have less than 11.10 alcohol.
quality’s median value: 6.00, max value:8.00, 75% of the red wines have less than 6.00 quality.
The main feature of interest in the dataset is the quality of the red wine. I want to find which features of red wine are the most important one in determining the quality.
At this point, It’s hard to decide which features will be the most to influence the quality. However, I think among all the features, ‘fixed.acidity’, ‘citric.acid’, ‘pH’, and ‘alcohol’ might have more influences to the quality than the rest of the features.
Yes. I created a new variable called ‘quality_lv’ to make it easier to detect different levels of quality. I also created a new variable called ‘pH.bucket’ to make it easier to see the distribution of pH for all the red wines.
The distribution of residual suger,chlorides,free sulfur dioxide, total sulfur dioxide and sulphates have long tails, so I set x coordinates limit for each distribution to have a better image for each one.
box_dots_plot <- function(variable){
ggplot(data=WINE,aes_q(x=~quality,y=as.name(variable)))+
geom_boxplot()+
geom_jitter(alpha=1/5)+
geom_line(aes(group=1),
stat = 'summary',
fun.y=median,
color='#E74C3C',
size=1,
alpha=0.8)
}
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## [1] 0.1240516
box_dots_plot('volatile.acidity')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
cor(x=WINE$quality,y=WINE$volatile.acidity)
## [1] -0.3905578
#as the quality improves, the volatile acidity decrease. So there is a negative relationship between volatile.acidity and quality.
box_dots_plot('citric.acid')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
group_by(WINE,quality) %>%
summarize(n_zero=sum(citric.acid==0)/n())
## # A tibble: 6 × 2
## quality n_zero
## <int> <dbl>
## 1 3 0.30000000
## 2 4 0.18867925
## 3 5 0.08370044
## 4 6 0.08463950
## 5 7 0.04020101
## 6 8 0.00000000
cor(x=WINE$quality,y=WINE$citric.acid)
## [1] 0.2263725
#as the quality improves, number of wines that has zero citric acid decreases. Therefore, quality and citric acid rate has positive relationship.
box_dots_plot('residual.sugar')+
ylim(NA,quantile(WINE$residual.sugar,0.9))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 156 rows containing non-finite values (stat_boxplot).
## Warning: Removed 156 rows containing non-finite values (stat_summary).
## Warning: Removed 160 rows containing missing values (geom_point).
cor(x=WINE$quality,y=WINE$residual.sugar)
## [1] 0.01373164
#almost no apparent relationship between residual sugar and quality.
box_dots_plot('chlorides')+
ylim(NA, quantile(WINE$chlorides,0.9))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 158 rows containing non-finite values (stat_boxplot).
## Warning: Removed 158 rows containing non-finite values (stat_summary).
## Warning: Removed 159 rows containing missing values (geom_point).
cor(x=WINE$quality,y=WINE$chlorides)
## [1] -0.1289066
#Weak relationship between chlorides and quality.
box_dots_plot('free.sulfur.dioxide')+
geom_hline(yintercept = 50,color='#F1C40F',linetype=2, size=1.5)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
cor(x=WINE$quality,y=WINE$free.sulfur.dioxide)
## [1] -0.05065606
#no apparent relationship between free sulfur dioxide and quality
box_dots_plot('total.sulfur.dioxide')+
ylim(NA,200)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing non-finite values (stat_summary).
## Warning: Removed 2 rows containing missing values (geom_point).
cor(x=WINE$quality,y=WINE$total.sulfur.dioxide)
## [1] -0.1851003
#The bell shape distribution for the total surlfur dioxide is more concentrative around quality 5 and 6, as the quality improve further, the total sulfur dioxide decreases. Therefore, there is a negative relationship between total sulfur dioxide and quality.
box_dots_plot('density')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
cor(x=WINE$quality,y=WINE$density)
## [1] -0.1749192
#As the quality improves, density decreases gradually. Therefore, there is a negative relationship between density and quality.
box_dots_plot('pH')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
cor(x=WINE$quality,y=WINE$pH)
## [1] -0.05773139
#No apparent relationship between pH and quality.
box_dots_plot('sulphates')+
ylim(NA,quantile(WINE$sulphates,0.9))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 150 rows containing non-finite values (stat_boxplot).
## Warning: Removed 150 rows containing non-finite values (stat_summary).
## Warning: Removed 157 rows containing missing values (geom_point).
cor(x=WINE$quality,y=WINE$sulphates)
## [1] 0.2513971
#As the quality improves, sulphates also increases. Therefore, there is a positive relationship between sulphates and quality.
box_dots_plot('alcohol')+
xlab('Quality Level')+
ylab('Alcohol')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
cor(x=WINE$quality,y=WINE$alcohol)
## [1] 0.4761663
#alcohol is the most influential element in determining the quality.
#As quality improves, Alcohol concentration increases too.
cor(x=WINE$density,y=WINE$alcohol)
## [1] -0.4961798
#density is the most influential element to alcohol.
#lower density means higher alcohol concentration, therefore, lower density means higher quality.(negative relationship)
cor(x=WINE$residual.sugar,y=WINE$density)
## [1] 0.3552834
#except alcohol, sugar content is also very important in determining the density. Higher residual sugar level causes higher density, which is a positive relationship. Hence, residual sugar and alcohol have negative relationship, and negative relationship with wine quality.
#use the boxplot to verify the statement above.
re_plot <- function(variable1, variable2){
ggplot(data=WINE,aes_q(x=as.name(variable1),y=as.name(variable2)))+
geom_boxplot()+
geom_jitter(alpha=1/5)+
geom_line(aes(group=1),
stat = 'summary',
fun.y=median,
color='#E74C3C',
size=1,
alpha=0.8)
}
re_plot('quality_lv','alcohol')+
xlab('Quality Level')+
ylab('Alcohol')
#alcohol and quality have positive relationship
re_plot('alcohol','density')+
xlim(NA,quantile(WINE$alcohol,0.9))+
xlab('Alcohol')+
ylab('Density')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 141 rows containing non-finite values (stat_boxplot).
## Warning: Removed 141 rows containing non-finite values (stat_summary).
## Warning: Removed 149 rows containing missing values (geom_point).
#density and alcohol have negative relationship
re_plot('quality_lv','density')+
xlab('Quality')+
ylab('Density')
#density and quality have negative relationship
re_plot('residual.sugar','density')+
xlab('Sugar')+
ylab('Density')
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
#density and residual sugar have positive relationship
The main feature I’m interest in here is the wine quality, some other variables that have strong relationship with wine quality are:
1.volatile.acidity:as the quality improves, the volatile acidity decrease. So there is a negative relationship between volatile.acidity and quality.
3.total sulfur dioxide: The bell shape distribution for the total surlfur dioxide is more concentrative around quality 5 and 6, as the quality improve further, the total sulfur dioxide decreases. Therefore, there is a negative relationship between total sulfur dioxide and quality.
4.density: As the quality improves, density decreases gradually. Therefore, there is a negative relationship between density and quality.
5.sulphates: As the quality improves, sulphates also increases. Therefore, there is a positive relationship between sulphates and quality.
6.alcohol:alcohol is the most influential element in determining the quality. As quality improves, Alcohol concentration increases too, so they have positive relationship.
1.density is the most influential element to alcohol. lower density means higher alcohol concentration, therefore, lower density means higher quality.(negative relationship)
2.except alcohol, sugar content is also very important in determining the density. Higher residual sugar level causes higher density, which is a positive relationship. Hence, residual sugar and alcohol have negative relationship, and negative relationship with wine quality.
The strongest relationship I found is the relationship the wine quality has with alcohol concentration, which has correlation 0.4761663, and means that as the alcohol concentration increase, the higher the wine’s quality.
ggplot(aes(x=volatile.acidity, y=alcohol, color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$volatile.acidity,y=WINE$quality)
## [1] -0.3905578
ggplot(aes(x=volatile.acidity,color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#higher quality wine has lower volatile acidity level.most high quality wines have volatile acidity around 0.4, most average quality wines have volatile acidity around 0.5, and most poor quality wines have volatile acidity around 0.7.
ggplot(aes(x=fixed.acidity, y=alcohol, color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$fixed.acidity,y=WINE$quality)
## [1] 0.1240516
ggplot(aes(x=fixed.acidity, color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#higher quality wine has higher level of fixed acidity. poor and average quality of wine has most fixed acidity around 6-7, while high quality of wine has most around 9-10.
ggplot(aes(x=volatile.acidity, y=fixed.acidity, color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$volatile.acidity,y=WINE$fixed.acidity)
## [1] -0.2561309
#from the graph above, we can prove the first two points are correct which basicaly state that wine quality and fixed acidity has positive relationship and negative realtionship with volatile acidity.
ggplot(aes(x=citric.acid, y=alcohol, color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor.test(x=WINE$citric.acid, y=WINE$quality)
##
## Pearson's product-moment correlation
##
## data: WINE$citric.acid and WINE$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
ggplot(aes(x=citric.acid, color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#higher quality wine has higher level of citric acid.
ggplot(aes(x=residual.sugar, y=alcohol, color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$residual.sugar,y=WINE$quality)
## [1] 0.01373164
ggplot(aes(x=residual.sugar, color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#There is no correlation between the quality and the residual sugar level.
ggplot(aes(x=chlorides, y=alcohol, color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$chlorides,y=WINE$quality)
## [1] -0.1289066
ggplot(aes(x=chlorides,color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#In general, higher quality wine has lower level of chlorides, although the correlation is very weak.
ggplot(aes(x=free.sulfur.dioxide, y=alcohol,color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$free.sulfur.dioxide,y=WINE$quality)
## [1] -0.05065606
ggplot(aes(x=free.sulfur.dioxide, color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#There is no correlation between the quality and the concentration level of free sulfur dioxide.
ggplot(aes(x=total.sulfur.dioxide,y=alcohol,color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$total.sulfur.dioxide,y=WINE$quality)
## [1] -0.1851003
ggplot(aes(x=total.sulfur.dioxide,color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#In general, higher quality wine has lower level of total sulfur dioxide, although the correlation is very weak.
ggplot(aes(x=density,y=alcohol,color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol = 3)
cor(x=WINE$density,y=WINE$quality)
## [1] -0.1749192
ggplot(aes(x=density,color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#In general, wine with higher quality has lower level of density, although it's not the case between wine with poor quality and wine with average quality, but wine with high quality do have lower level of density in general.
ggplot(aes(x=pH,y=alcohol,color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol = 3)
cor(x=WINE$pH,y=WINE$quality)
## [1] -0.05773139
ggplot(aes(x=pH, color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#There is no correlation between the pH and the qualities of wine.
ggplot(aes(x=sulphates,y=alcohol,color=quality_lv),data=WINE)+
geom_point()+
facet_wrap(~quality_lv,ncol=3)
cor(x=WINE$sulphates,y=WINE$quality)
## [1] 0.2513971
ggplot(aes(x=sulphates, color=quality_lv),data=WINE)+
geom_density()+
theme_classic()
#In general, wine with higher quality has higher level of sulphates concentration.
ggplot(aes(x=fixed.acidity+volatile.acidity+citric.acid,y=pH),
data=WINE)+
geom_point(alpha=0.2)+
geom_smooth(method = 'loess', color='red')
cor(x=WINE$fixed.acidity+WINE$volatile.acidity+WINE$citric.acid,y=WINE$pH)
## [1] -0.6834838
#As the graph shows above, pH value get lower when the overall acid concentration get higher.
Higher amount of volatile acidity/citric acid/sulphates along with higher alcohol concentration yield better quality wines.
lower concentration of volatile acidity along with higher alcohol concentration yield better quality wines.
pH value is mainly determined by three factors: fixed acidity, volatile acidity, and citric acid with correlation -0.683.
From the previous analysis about which features is the most influential one to the quality of wine, alcohol stands out with the strongest positive correlation with quality. From the boxplot above, we can conclude that the higher level of alcohol the wine has, the better quality it becomes.
Besides alcohol, volatile acidity is the second most influential feature to the quality of wine. From the previous analysis and the boxplot above, we can conclude that the higher level of volatile acidity will cause lower level of qualuty of the wine.
From the plot above, we can see the combined effect of alcohol and volatile acidity on the quality of wine: The wines with higher level of alcohol and lower level of volatile acidity have better quality in general, and the wines with lower level of alcohol and higher level of volatile acidity mostly have lower quality rating.
The red wine dataset has 1599 samples and 11 features, in order to analyze which features are important to the quality of wines, I build 3 linear models to explore those features and by coloring wines with different quality in the multivariate plots, it becomes much more clear and easy to interpret different features have what kind of influences on the quality of wines.
From the analysis, we can see that alcohol and volatile acidity contributes to the top two most important features to the wine’s quality. Quality and alcohol has a positive relationship while has negative relationship with volatile acidity.
For the purpose of having a better analysis about this dataset, it would be better if it includes more wines with poor or high quality, so that we can improve our accuracy while conducting the analysis.